In [ ]:

    
# Import all of the things you need to import!



In [2]:

    
import scipy
import sklearn
import nltk
import pandas as pd









    



/Users/Monica/.virtualenvs/dataanalysis/lib/python3.5/site-packages/matplotlib/__init__.py:1035: UserWarning: Duplicate key in file "/Users/Monica/.matplotlib/matplotlibrc", line #2
  (fname, cnt))

Homework 14 (or so): TF-IDF text analysis and clustering

Hooray, we kind of figured out how text analysis works! Some of it is still magic, but at least the TF and IDF parts make a little sense. Kind of. Somewhat.

No, just kidding, we're professionals now.

Investigating the Congressional Record

The Congressional Record is more or less what happened in Congress every single day. Speeches and all that. A good large source of text data, maybe?

Let's pretend it's totally secret but we just got it leaked to us in a data dump, and we need to check it out. It was leaked from this page here.



In [3]:

    
# If you'd like to download it through the command line...
!curl -O http://www.cs.cornell.edu/home/llee/data/convote/convote_v1.1.tar.gz









    



  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9607k  100 9607k    0     0  3071k      0  0:00:03  0:00:03 --:--:-- 3072k



In [3]:

    
# And then extract it through the command line...
!tar -zxf convote_v1.1.tar.gz

You can explore the files if you'd like, but we're going to get the ones from convote_v1.1/data_stage_one/development_set/. It's a bunch of text files.



In [4]:

    
# glob finds files matching a certain filename pattern
import glob

# Give me all the text files
paths = glob.glob('convote_v1.1/data_stage_one/development_set/*')
paths[:5]









    Out[4]:





['convote_v1.1/data_stage_one/development_set/052_400011_0327014_DON.txt',
 'convote_v1.1/data_stage_one/development_set/052_400011_0327025_DON.txt',
 'convote_v1.1/data_stage_one/development_set/052_400011_0327044_DON.txt',
 'convote_v1.1/data_stage_one/development_set/052_400011_0327046_DON.txt',
 'convote_v1.1/data_stage_one/development_set/052_400011_1479036_DON.txt']



In [5]:

    
len(paths)









    Out[5]:





702

So great, we have 702 of them. Now let's import them.



In [6]:

    
speeches = []
for path in paths:
    with open(path) as speech_file:
        speech = {
            'pathname': path,
            'filename': path.split('/')[-1],
            'content': speech_file.read()
        }
    speeches.append(speech)
speeches_df = pd.DataFrame(speeches)
speeches_df.head()









    Out[6]:






  
    
      
      content
      filename
      pathname
    
  
  
    
      0
      mr. chairman , i thank the gentlewoman for yie...
      052_400011_0327014_DON.txt
      convote_v1.1/data_stage_one/development_set/05...
    
    
      1
      mr. chairman , i want to thank my good friend ...
      052_400011_0327025_DON.txt
      convote_v1.1/data_stage_one/development_set/05...
    
    
      2
      mr. chairman , i rise to make two fundamental ...
      052_400011_0327044_DON.txt
      convote_v1.1/data_stage_one/development_set/05...
    
    
      3
      mr. chairman , reclaiming my time , let me mak...
      052_400011_0327046_DON.txt
      convote_v1.1/data_stage_one/development_set/05...
    
    
      4
      mr. chairman , i thank my distinguished collea...
      052_400011_1479036_DON.txt
      convote_v1.1/data_stage_one/development_set/05...

In class we had the texts variable. For the homework can just do speeches_df['content'] to get the same sort of list of stuff.

Take a look at the contents of the first 5 speeches



In [7]:

    
speeches_df['content'].head(5)









    Out[7]:





0    mr. chairman , i thank the gentlewoman for yie...
1    mr. chairman , i want to thank my good friend ...
2    mr. chairman , i rise to make two fundamental ...
3    mr. chairman , reclaiming my time , let me mak...
4    mr. chairman , i thank my distinguished collea...
Name: content, dtype: object

Doing our analysis

Use the sklearn package and a plain boring CountVectorizer to get a list of all of the tokens used in the speeches. If it won't list them all, that's ok! Make a dataframe with those terms as columns.

Be sure to include English-language stopwords



In [50]:

    
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer(max_features=100, stop_words='english')
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from nltk.stem.porter import PorterStemmer



In [51]:

    
X = count_vectorizer.fit_transform(speeches_df['content'])



In [52]:

    
X.toarray()









    Out[52]:





array([[0, 1, 3, ..., 0, 0, 1],
       [0, 0, 1, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 1],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 2, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

Okay, it's far too big to even look at. Let's try to get a list of features from a new CountVectorizer that only takes the top 100 words.



In [53]:

    
tophundred_df = pd.DataFrame(X.toarray(), columns=count_vectorizer.get_feature_names())



In [54]:

    
tophundred_df









    Out[54]:






  
    
      
      000
      11
      act
      allow
      amendment
      america
      american
      amp
      association
      balance
      ...
      trade
      united
      urge
      vote
      want
      way
      work
      year
      years
      yield
    
  
  
    
      0
      0
      1
      3
      0
      0
      0
      3
      0
      0
      0
      ...
      0
      1
      0
      0
      1
      1
      0
      0
      0
      1
    
    
      1
      0
      0
      1
      1
      1
      0
      0
      0
      0
      1
      ...
      0
      0
      0
      1
      1
      3
      0
      1
      0
      0
    
    
      2
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      0
      1
    
    
      3
      0
      0
      0
      0
      0
      1
      0
      0
      0
      1
      ...
      0
      1
      0
      1
      1
      1
      0
      0
      0
      0
    
    
      4
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      2
      0
      0
      0
      0
      0
      2
    
    
      5
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
    
      6
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      7
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
    
      8
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      9
      0
      0
      0
      1
      0
      0
      1
      0
      0
      0
      ...
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      10
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      11
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      12
      0
      1
      1
      0
      1
      0
      0
      0
      0
      1
      ...
      0
      2
      0
      1
      2
      1
      0
      0
      0
      1
    
    
      13
      0
      8
      8
      0
      0
      0
      0
      0
      0
      1
      ...
      0
      1
      0
      1
      0
      1
      0
      4
      1
      1
    
    
      14
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      15
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      1
      0
      0
      0
      3
      2
      0
      0
    
    
      16
      1
      0
      0
      0
      2
      1
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      2
      0
      0
    
    
      17
      0
      0
      0
      2
      4
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      1
      4
      1
      3
      2
      1
      0
    
    
      18
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      19
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      2
    
    
      20
      0
      5
      1
      0
      0
      3
      2
      0
      0
      1
      ...
      0
      1
      1
      0
      0
      0
      0
      0
      1
      0
    
    
      21
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      1
      0
      0
      0
      0
      1
    
    
      22
      0
      2
      0
      0
      1
      1
      0
      0
      0
      1
      ...
      0
      0
      0
      0
      0
      1
      0
      0
      0
      1
    
    
      23
      0
      0
      0
      0
      4
      0
      0
      0
      0
      1
      ...
      0
      0
      1
      0
      1
      0
      0
      0
      0
      1
    
    
      24
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      25
      2
      0
      0
      0
      1
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      1
      0
      2
    
    
      26
      0
      0
      0
      0
      0
      0
      0
      0
      0
      2
      ...
      0
      0
      1
      0
      0
      0
      0
      0
      0
      2
    
    
      27
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      28
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      29
      2
      0
      0
      0
      1
      0
      0
      0
      0
      1
      ...
      0
      1
      0
      0
      0
      0
      2
      1
      0
      1
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      672
      0
      5
      1
      0
      7
      0
      0
      0
      0
      1
      ...
      0
      1
      1
      0
      0
      0
      0
      0
      0
      1
    
    
      673
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
    
      674
      0
      6
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      1
      0
      0
      1
      0
      1
    
    
      675
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
    
      676
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      677
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      678
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      679
      0
      4
      0
      0
      0
      0
      0
      0
      0
      1
      ...
      0
      0
      1
      0
      2
      0
      0
      0
      0
      2
    
    
      680
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      681
      0
      0
      1
      0
      0
      0
      1
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      682
      1
      0
      3
      0
      0
      1
      4
      0
      7
      1
      ...
      0
      1
      0
      0
      1
      0
      0
      1
      2
      1
    
    
      683
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
    
      684
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
    
      685
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      686
      1
      3
      2
      1
      0
      2
      0
      0
      0
      0
      ...
      0
      0
      1
      0
      0
      0
      0
      1
      0
      1
    
    
      687
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
    
      688
      0
      1
      0
      0
      6
      0
      0
      0
      0
      1
      ...
      0
      0
      1
      0
      0
      0
      0
      0
      0
      2
    
    
      689
      0
      3
      2
      0
      4
      0
      0
      0
      1
      1
      ...
      0
      0
      1
      1
      0
      0
      0
      0
      0
      1
    
    
      690
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      2
      0
      0
      0
      0
      2
    
    
      691
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      692
      0
      0
      0
      1
      7
      3
      1
      0
      1
      1
      ...
      0
      0
      1
      1
      0
      0
      0
      1
      1
      2
    
    
      693
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      694
      0
      0
      2
      0
      0
      1
      1
      0
      0
      1
      ...
      0
      0
      1
      0
      0
      0
      0
      0
      0
      1
    
    
      695
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      696
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      697
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      2
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      698
      0
      1
      0
      0
      2
      0
      2
      0
      0
      0
      ...
      0
      2
      0
      1
      0
      1
      0
      0
      0
      0
    
    
      699
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      700
      0
      1
      2
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      1
      2
      2
      2
      2
      0
      0
      0
    
    
      701
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
  

702 rows × 100 columns



In [ ]:

Now let's push all of that into a dataframe with nicely named columns.



In [ ]:

Everyone seems to start their speeches with "mr chairman" - how many speeches are there total, and many don't mention "chairman" and how many mention neither "mr" nor "chairman"?



In [55]:

    
mrchairman_df = pd.DataFrame([tophundred_df['mr'], tophundred_df['chairman'], tophundred_df['mr'] + tophundred_df['chairman']], index=["mr", "chairman", "mr + chairman"]).T



In [56]:

    
mrchairman_df









    Out[56]:






  
    
      
      mr
      chairman
      mr + chairman
    
  
  
    
      0
      2
      3
      5
    
    
      1
      4
      2
      6
    
    
      2
      3
      2
      5
    
    
      3
      3
      2
      5
    
    
      4
      2
      1
      3
    
    
      5
      0
      0
      0
    
    
      6
      1
      1
      2
    
    
      7
      0
      0
      0
    
    
      8
      1
      1
      2
    
    
      9
      1
      2
      3
    
    
      10
      0
      0
      0
    
    
      11
      1
      1
      2
    
    
      12
      6
      3
      9
    
    
      13
      4
      2
      6
    
    
      14
      1
      1
      2
    
    
      15
      7
      1
      8
    
    
      16
      1
      1
      2
    
    
      17
      1
      3
      4
    
    
      18
      1
      0
      1
    
    
      19
      3
      0
      3
    
    
      20
      10
      0
      10
    
    
      21
      3
      1
      4
    
    
      22
      3
      0
      3
    
    
      23
      2
      0
      2
    
    
      24
      1
      0
      1
    
    
      25
      5
      0
      5
    
    
      26
      2
      0
      2
    
    
      27
      1
      0
      1
    
    
      28
      1
      1
      2
    
    
      29
      4
      4
      8
    
    
      ...
      ...
      ...
      ...
    
    
      672
      5
      4
      9
    
    
      673
      2
      1
      3
    
    
      674
      3
      2
      5
    
    
      675
      1
      1
      2
    
    
      676
      1
      1
      2
    
    
      677
      1
      1
      2
    
    
      678
      1
      1
      2
    
    
      679
      2
      2
      4
    
    
      680
      1
      1
      2
    
    
      681
      1
      1
      2
    
    
      682
      4
      4
      8
    
    
      683
      2
      1
      3
    
    
      684
      2
      2
      4
    
    
      685
      1
      1
      2
    
    
      686
      2
      2
      4
    
    
      687
      1
      1
      2
    
    
      688
      5
      3
      8
    
    
      689
      6
      6
      12
    
    
      690
      4
      4
      8
    
    
      691
      1
      1
      2
    
    
      692
      7
      7
      14
    
    
      693
      1
      0
      1
    
    
      694
      4
      0
      4
    
    
      695
      1
      0
      1
    
    
      696
      1
      1
      2
    
    
      697
      1
      1
      2
    
    
      698
      1
      1
      2
    
    
      699
      1
      0
      1
    
    
      700
      3
      0
      3
    
    
      701
      1
      0
      1
    
  

702 rows × 3 columns



In [83]:

    
num_speeches = len(mrchairman_df)



In [87]:

    
mrmention_df = mrchairman_df[(mrchairman_df['mr'] > 0)]
mr_mention = len(mrmention_df)



In [88]:

    
mrorchairmanmention_df = mrchairman_df[(mrchairman_df['mr + chairman'] > 0)]
mrorchair_mention = len(mrorchairmanmention_df)



In [93]:

    
print("There are",num_speeches,"speeches. Only", num_speeches - mr_mention, "do not mention mr and ", num_speeches - mrorchair_mention, "do not mention mr or chairman")









    



There are 702 speeches. Only 79 do not mention mr and  76 do not mention mr or chairman



In [ ]:



In [95]:

    
tophundred_df.columns









    Out[95]:





Index(['000', '11', 'act', 'allow', 'amendment', 'america', 'american', 'amp',
       'association', 'balance', 'based', 'believe', 'bipartisan', 'chairman',
       'children', 'china', 'civil', 'colleagues', 'committee', 'congress',
       'country', 'days', 'debate', 'discrimination', 'does', 'education',
       'election', 'elections', 'fact', 'faith', 'federal', 'frivolous',
       'funding', 'gentleman', 'going', 'good', 'government', 'gt', 'head',
       'health', 'help', 'house', 'important', 'issue', 'just', 'know', 'law',
       'legislation', 'let', 'like', 'lt', 'make', 'member', 'members',
       'million', 'money', 'mr', 'nation', 'national', 'nbsp', 'need', 'new',
       'order', 'organizations', 'people', 'percent', 'policy', 'president',
       'process', 'program', 'programs', 'provide', 'religious', 'right',
       'rights', 'rule', 'rules', 'say', 'school', 'services', 'speaker',
       'start', 'state', 'states', 'support', 'teachers', 'thank', 'think',
       'time', 'today', 'trade', 'united', 'urge', 'vote', 'want', 'way',
       'work', 'year', 'years', 'yield'],
      dtype='object')

What is the index of the speech thank is the most thankful, a.k.a. includes the word 'thank' the most times?



In [96]:

    
tophundred_df['thank'].sort_values(ascending=False).head(1) # thank is not in the top 100 words unless you remove stop words









    Out[96]:





577    9
Name: thank, dtype: int64

If I'm searching for China and trade, what are the top 3 speeches to read according to the CountVectoriser?



In [106]:

    
chinatrade_df = pd.DataFrame([tophundred_df['china'] + tophundred_df['trade']], index=["China + trade"]).T



In [110]:

    
chinatrade_df['China + trade'].sort_values(ascending=False).head(3)









    Out[110]:





379    92
399    36
345    27
Name: China + trade, dtype: int64

Now what if I'm using a TfidfVectorizer?



In [121]:

    
porter_stemmer = PorterStemmer()

def stemming_tokenizer(str_input):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    words = [porter_stemmer.stem(word) for word in words]
    return words

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=stemming_tokenizer, use_idf=False, norm='l1', max_features = 100)
X = tfidf_vectorizer.fit_transform(speeches_df['content'])
chinatrade_tfidfpd = pd.DataFrame(X.toarray(), columns=tfidf_vectorizer.get_feature_names())



In [125]:

    
chinatrade_tfidfpd = pd.DataFrame([tophundred_df['china'] + tophundred_df['trade']], index=["China + trade"]).T



In [128]:

    
# chinatrade_tfidfpd



In [126]:

    
chinatrade_tfidfpd['China + trade'].sort_values(ascending=False).head(3)









    Out[126]:





379    92
399    36
345    27
Name: China + trade, dtype: int64

What's the content of the speeches? Here's a way to get them:



In [129]:

    
# index 0 is the first speech, which was the first one imported.
paths[0]









    Out[129]:





'convote_v1.1/data_stage_one/development_set/052_400011_0327014_DON.txt'



In [130]:

    
# Pass that into 'cat' using { } which lets you put variables in shell commands
# that way you can pass the path to cat
!cat {paths[0]}









    



mr. chairman , i thank the gentlewoman for yielding me this time . 
my good colleague from california raised the exact and critical point . 
the question is , what happens during those 45 days ? 
we will need to support elections . 
there is not a single member of this house who has not supported some form of general election , a special election , to replace the members at some point . 
but during that 45 days , what happens ? 
the chair of the constitution subcommittee says this is what happens : martial law . 
we do not know who would fill the vacancy of the presidency , but we do know that the succession act most likely suggests it would be an unelected person . 
the sponsors of the bill before us today insist , and i think rightfully so , on the importance of elections . 
but to then say that during a 45-day period we would have none of the checks and balances so fundamental to our constitution , none of the separation of powers , and that the presidency would be filled by an unelected member of the cabinet who not a single member of this country , not a single citizen , voted to fill that position , and that that person would have no checks and balances from congress for a period of 45 days i find extraordinary . 
i find it inconsistent . 
i find it illogical , and , frankly , i find it dangerous . 
the gentleman from wisconsin refused earlier to yield time , but i was going to ask him , if virginia has those elections in a shorter time period , they should be commended for that . 
so now we have a situation in the congress where the virginia delegation has sent their members here , but many other states do not have members here . 
do they at that point elect a speaker of the house in the absence of other members ? 
and then three more states elect their representatives , temporary replacements , or full replacements at that point . 
they come in . 
do they elect a new speaker ? 
and if that happens , who becomes the president under the succession act ? 
this bill does not address that question . 
this bill responds to real threats with fantasies . 
it responds with the fantasy , first of all , that a lot of people will still survive ; but we have no guarantee of that . 
it responds with the fantasy that those who do survive will do the right thing . 
we are here having this debate , we have debates every day , because people differ on what the right thing is to do . 
i have been in very traumatic situations with people in severe car wrecks and mountain climbing accidents . 
my experience has not been that crisis imbues universal sagacity and fairness . 
it has not been that . 
people respond in extraordinary ways , and we must preserve an institution that has the deliberative body and the checks and balances to meet those challenges . 
many of our states are going increasingly to mail-in ballots . 
we in this body were effectively disabled by an anthrax attack not long after september 11 . 
i would ask my dear friends , will you conduct this election in 45 days if there is anthrax in the mail and still preserve the franchise of the american people ? 
how will you do that ? 
you have no answer to that question . 
i find it extraordinary , frankly , that while saying you do not want to amend the constitution , we began this very congress by amending the constitution through the rule , by undermining the principle that a quorum is 50 percent of the body and instead saying it is however many people survive . 
and if that rule applies , who will designate it , who will implement it ? 
the speaker , or the speaker 's designee ? 
again , not an elected person , as you say is so critical and i believe is critical , but a temporary appointee , frankly , who not a single other member of this body knows who they are . 
so we not only have an unelected person , we have an unknown person who will convene this body , and who , by the way , could conceivably convene it for their own election to then become the president of the united states under the succession act . 
you have refused steadfastly to debate this real issue broadly . 
you had a mock debate in the committee on the judiciary in which the distinguished chairman presented my bill without allowing me the courtesy or dignity to defend it myself . 
and on that , you proudly say you defend democracy . 
sir , i think you dissemble in that regard . 
here is the fundamental question for us , my friends , and it is this : the american people are watching television and an announcement comes on and says the congress has been destroyed in a nuclear attack , the president and vice president are killed and the supreme court is dead and thousands of our citizens in this town are . 
what happens next ? 
under your bill , 45 days of chaos . 
apparently , according to the committee on the judiciary subcommittee on the constitution chairman , 45 days of marshal law , rule of this country by an unelected president with no checks and balances . 
or an alternative , an alternative which says quite simply that the people have entrusted the representatives they send here to make profound decisions , war , taxation , a host of other things , and those representatives would have the power under the bill of the gentleman from california ( mr. rohrabacher ) xz4003430 bill or mine to designate temporary successors , temporary , only until we can have a real election . 
the american people , in one scenario , are told we do not know who is going to run the country , we have no representatives ; where in another you will have temporary representatives carrying your interests to this great body while we deliberate and have real elections . 
that is the choice . 
you are making the wrong choice today if you think you have solved this problem .

Now search for something else! Another two terms that might show up. elections and chaos? Whatever you thnik might be interesting.



In [ ]:



In [ ]:

Enough of this garbage, let's cluster

Using a simple counting vectorizer, cluster the documents into eight categories, telling me what the top terms are per category.

Using a term frequency vectorizer, cluster the documents into eight categories, telling me what the top terms are per category.

Using a term frequency inverse document frequency vectorizer, cluster the documents into eight categories, telling me what the top terms are per category.



In [152]:

    
# Initialize a vectorizer
vectorizer = TfidfVectorizer(use_idf=True, tokenizer=stemming_tokenizer, stop_words='english', max_features =8)
X = vectorizer.fit_transform(speeches_df['content'])



In [153]:

    
X









    Out[153]:





<702x8 sparse matrix of type '<class 'numpy.float64'>'
	with 2593 stored elements in Compressed Sparse Row format>



In [154]:

    
pd.DataFrame(X.toarray())









    Out[154]:






  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
    
  
  
    
      0
      0.125349
      0.144859
      0.000000
      0.074980
      0.071464
      0.000000
      0.962678
      0.160706
    
    
      1
      0.116806
      0.179982
      0.000000
      0.279478
      0.000000
      0.176796
      0.897067
      0.199671
    
    
      2
      0.000000
      0.146615
      0.000000
      0.170749
      0.000000
      0.000000
      0.974345
      0.000000
    
    
      3
      0.000000
      0.187550
      0.000000
      0.218422
      0.000000
      0.000000
      0.934786
      0.208067
    
    
      4
      0.207346
      0.159747
      0.000000
      0.248056
      0.236425
      0.000000
      0.884677
      0.177222
    
    
      5
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      6
      0.000000
      0.187830
      0.000000
      0.145832
      0.277988
      0.000000
      0.832161
      0.416754
    
    
      7
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      8
      0.000000
      0.594065
      0.000000
      0.461234
      0.000000
      0.000000
      0.000000
      0.659052
    
    
      9
      0.000000
      0.580268
      0.569997
      0.225261
      0.429398
      0.000000
      0.321353
      0.000000
    
    
      10
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      1.000000
      0.000000
    
    
      11
      0.000000
      0.391933
      0.000000
      0.304298
      0.000000
      0.000000
      0.868211
      0.000000
    
    
      12
      0.320114
      0.147976
      0.000000
      0.229778
      0.146003
      0.096904
      0.655593
      0.601933
    
    
      13
      0.222858
      0.343395
      0.000000
      0.533226
      0.508224
      0.000000
      0.380344
      0.380960
    
    
      14
      0.000000
      0.789878
      0.000000
      0.613264
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      15
      0.000000
      0.107972
      0.000000
      0.586810
      0.000000
      0.000000
      0.717540
      0.359351
    
    
      16
      0.206514
      0.079553
      0.000000
      0.061765
      0.000000
      0.000000
      0.969237
      0.088255
    
    
      17
      0.429689
      0.248285
      0.000000
      0.064256
      0.000000
      0.000000
      0.458334
      0.734522
    
    
      18
      0.000000
      0.000000
      0.000000
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      19
      0.000000
      0.000000
      0.000000
      0.596345
      0.000000
      0.000000
      0.567155
      0.568073
    
    
      20
      0.000000
      0.000000
      0.000000
      0.385738
      0.220591
      0.000000
      0.880457
      0.165353
    
    
      21
      0.000000
      0.361410
      0.000000
      0.841799
      0.000000
      0.000000
      0.000000
      0.400945
    
    
      22
      0.175505
      0.000000
      0.000000
      0.314945
      0.200119
      0.000000
      0.898587
      0.150007
    
    
      23
      0.707060
      0.000000
      0.000000
      0.211470
      0.000000
      0.000000
      0.603358
      0.302167
    
    
      24
      0.000000
      0.000000
      0.000000
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      25
      0.241966
      0.000000
      0.000000
      0.723681
      0.275900
      0.000000
      0.412955
      0.413624
    
    
      26
      0.000000
      0.000000
      0.000000
      0.530707
      0.000000
      0.000000
      0.378548
      0.758321
    
    
      27
      0.000000
      0.000000
      0.000000
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      28
      0.715863
      0.551525
      0.000000
      0.428206
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      29
      0.146628
      0.451870
      0.000000
      0.350833
      0.668767
      0.000000
      0.250246
      0.375976
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      672
      0.536599
      0.236237
      0.000000
      0.229269
      0.262223
      0.000000
      0.719554
      0.131040
    
    
      673
      0.000000
      0.541434
      0.000000
      0.840743
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      674
      0.389411
      0.600032
      0.000000
      0.698800
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      675
      0.000000
      0.594065
      0.000000
      0.461234
      0.000000
      0.000000
      0.000000
      0.659052
    
    
      676
      0.000000
      0.594065
      0.000000
      0.461234
      0.000000
      0.000000
      0.000000
      0.659052
    
    
      677
      0.715863
      0.551525
      0.000000
      0.428206
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      678
      0.000000
      0.594065
      0.000000
      0.461234
      0.000000
      0.000000
      0.000000
      0.659052
    
    
      679
      0.000000
      0.398550
      0.000000
      0.309436
      0.589854
      0.391495
      0.220717
      0.442149
    
    
      680
      0.000000
      0.789878
      0.000000
      0.613264
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      681
      0.000000
      0.216142
      0.000000
      0.167813
      0.639779
      0.000000
      0.718195
      0.000000
    
    
      682
      0.000000
      0.220822
      0.000000
      0.171447
      0.817042
      0.000000
      0.489165
      0.122489
    
    
      683
      0.000000
      0.541434
      0.000000
      0.840743
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      684
      0.000000
      0.789878
      0.000000
      0.613264
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      685
      0.000000
      0.594065
      0.000000
      0.461234
      0.000000
      0.000000
      0.000000
      0.659052
    
    
      686
      0.000000
      0.137266
      0.134836
      0.106574
      0.914191
      0.000000
      0.152036
      0.304564
    
    
      687
      0.000000
      0.594065
      0.000000
      0.461234
      0.000000
      0.000000
      0.000000
      0.659052
    
    
      688
      0.624638
      0.240621
      0.000000
      0.311365
      0.593532
      0.000000
      0.266512
      0.177962
    
    
      689
      0.391844
      0.452835
      0.000000
      0.351582
      0.335098
      0.000000
      0.585153
      0.251186
    
    
      690
      0.000000
      0.771580
      0.000000
      0.599057
      0.000000
      0.000000
      0.000000
      0.213996
    
    
      691
      0.000000
      0.496271
      0.000000
      0.385307
      0.000000
      0.000000
      0.549670
      0.550560
    
    
      692
      0.443789
      0.341910
      0.000000
      0.265460
      0.433737
      0.000000
      0.595100
      0.270938
    
    
      693
      0.000000
      0.000000
      0.000000
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      694
      0.000000
      0.000000
      0.000000
      0.632522
      0.301432
      0.000000
      0.676757
      0.225951
    
    
      695
      0.000000
      0.000000
      0.000000
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      696
      0.000000
      0.161538
      0.317357
      0.125418
      0.239076
      0.000000
      0.894597
      0.000000
    
    
      697
      0.000000
      0.176029
      0.000000
      0.136670
      0.000000
      0.000000
      0.974851
      0.000000
    
    
      698
      0.558905
      0.215300
      0.000000
      0.167159
      0.318644
      0.000000
      0.715398
      0.000000
    
    
      699
      0.000000
      0.000000
      0.000000
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      700
      0.000000
      0.000000
      0.000000
      0.163468
      0.311607
      0.000000
      0.932801
      0.077859
    
    
      701
      0.000000
      0.000000
      0.000000
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
  

702 rows × 8 columns



In [155]:

    
from sklearn.cluster import KMeans
number_of_clusters = 8
km = KMeans(n_clusters=number_of_clusters)
km.fit(X)









    Out[155]:





KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=8, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)



In [156]:

    
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))









    



Top terms per cluster:
Cluster 0: mr chairman time thi amend
Cluster 1: thi mr time amend chairman
Cluster 2: time mr chairman thi amend
Cluster 3: start head thi amend mr
Cluster 4: time thi start s mr
Cluster 5: mr thi time start s
Cluster 6: amend mr chairman thi time
Cluster 7: s thi mr amend time



In [157]:

    
km.labels_









    Out[157]:





array([1, 1, 1, 1, 1, 4, 1, 4, 2, 7, 1, 1, 1, 7, 0, 1, 1, 2, 5, 2, 1, 2, 1,
       6, 5, 2, 2, 5, 6, 7, 2, 2, 0, 1, 1, 0, 7, 2, 6, 0, 2, 1, 1, 6, 6, 2,
       0, 6, 0, 4, 1, 0, 6, 0, 1, 1, 1, 1, 0, 7, 0, 2, 2, 7, 4, 7, 1, 1, 0,
       1, 4, 0, 0, 4, 1, 7, 0, 0, 6, 6, 0, 0, 6, 0, 4, 1, 2, 2, 2, 2, 2, 7,
       4, 4, 2, 0, 2, 2, 6, 2, 7, 2, 2, 1, 1, 0, 0, 2, 1, 2, 6, 0, 2, 5, 2,
       2, 1, 0, 7, 7, 4, 0, 7, 1, 1, 2, 1, 7, 1, 1, 2, 2, 1, 1, 4, 2, 7, 1,
       1, 0, 6, 6, 2, 6, 6, 7, 1, 6, 6, 7, 6, 0, 5, 1, 2, 1, 1, 5, 7, 6, 7,
       1, 4, 6, 1, 7, 7, 6, 1, 0, 6, 1, 6, 6, 6, 1, 5, 7, 5, 5, 1, 1, 6, 7,
       6, 4, 0, 7, 1, 6, 5, 5, 6, 2, 6, 6, 7, 1, 2, 0, 7, 1, 1, 7, 1, 1, 0,
       5, 6, 1, 7, 5, 6, 0, 4, 1, 5, 5, 1, 5, 5, 2, 5, 5, 1, 2, 5, 1, 1, 6,
       1, 1, 6, 1, 7, 2, 2, 6, 6, 0, 5, 6, 2, 1, 4, 0, 6, 2, 2, 6, 6, 6, 0,
       6, 6, 0, 6, 6, 0, 0, 1, 1, 6, 5, 0, 0, 6, 6, 0, 0, 0, 0, 7, 6, 2, 2,
       2, 6, 4, 2, 6, 2, 4, 2, 2, 2, 0, 6, 6, 0, 7, 1, 6, 1, 6, 1, 2, 2, 2,
       5, 2, 2, 7, 1, 1, 1, 1, 7, 1, 1, 6, 6, 0, 5, 7, 7, 1, 7, 1, 1, 5, 5,
       7, 1, 1, 0, 5, 7, 5, 5, 4, 1, 5, 5, 4, 5, 1, 2, 1, 7, 7, 1, 7, 7, 7,
       7, 2, 5, 1, 1, 1, 7, 1, 1, 1, 5, 6, 2, 6, 1, 5, 1, 7, 1, 1, 1, 7, 1,
       5, 2, 7, 5, 1, 2, 7, 2, 5, 2, 2, 7, 1, 5, 5, 5, 1, 5, 2, 4, 1, 4, 7,
       1, 1, 7, 1, 1, 4, 2, 1, 7, 1, 7, 7, 6, 1, 1, 1, 3, 3, 5, 3, 6, 5, 6,
       2, 6, 6, 6, 5, 5, 5, 5, 1, 3, 3, 3, 2, 0, 3, 2, 0, 0, 0, 0, 2, 1, 2,
       0, 0, 0, 2, 3, 2, 3, 0, 2, 0, 6, 6, 6, 6, 3, 0, 1, 6, 3, 0, 0, 0, 0,
       1, 4, 3, 1, 1, 3, 3, 2, 3, 6, 6, 2, 6, 2, 6, 2, 3, 3, 6, 2, 2, 3, 3,
       0, 3, 2, 3, 5, 3, 1, 3, 7, 0, 0, 1, 3, 0, 1, 5, 1, 1, 3, 3, 4, 1, 3,
       2, 5, 5, 1, 5, 1, 5, 2, 2, 3, 5, 3, 3, 5, 7, 4, 5, 7, 2, 1, 5, 1, 4,
       4, 4, 4, 7, 4, 6, 3, 1, 2, 2, 1, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 7, 3,
       3, 3, 3, 1, 1, 3, 4, 3, 3, 3, 2, 3, 3, 1, 3, 3, 3, 4, 3, 3, 3, 3, 3,
       3, 3, 3, 0, 3, 0, 3, 1, 0, 3, 0, 3, 3, 0, 3, 3, 3, 3, 3, 5, 1, 3, 3,
       3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 2, 3, 6, 1, 3, 6, 6, 0, 3,
       2, 3, 0, 0, 0, 0, 3, 0, 0, 0, 3, 0, 5, 3, 3, 1, 1, 1, 3, 3, 1, 1, 4,
       1, 7, 2, 1, 2, 1, 1, 1, 1, 0, 7, 1, 1, 1, 2, 1, 4, 7, 1, 1, 0, 0, 1,
       4, 1, 6, 6, 1, 1, 0, 0, 2, 2, 6, 2, 7, 0, 7, 7, 0, 0, 2, 7, 2, 7, 6,
       0, 2, 7, 5, 1, 5, 1, 1, 1, 5, 1, 5], dtype=int32)



In [159]:

    
speeches_df['content']









    Out[159]:





0      mr. chairman , i thank the gentlewoman for yie...
1      mr. chairman , i want to thank my good friend ...
2      mr. chairman , i rise to make two fundamental ...
3      mr. chairman , reclaiming my time , let me mak...
4      mr. chairman , i thank my distinguished collea...
5            i yield to the gentleman from illinois . \n
6      mr. chairman , reclaiming my time , the fact i...
7            i yield to the gentleman from illinois . \n
8      mr. chairman , reclaiming my time , i would be...
9      mr. chairman , i do not have it on the top of ...
10     okay . \nso we do not have that answer . \nlet...
11     mr. chairman , i would suggest that with these...
12     mr. chairman , i yield myself such time as i m...
13     mr. chairman , i yield myself the balance of t...
14          mr. chairman , i demand a recorded vote . \n
15     mr. chairman , i appreciated the gentleman fro...
16     mr. chairman , i am pleased to join the gentle...
17     mr. chairman , i thank the gentlewoman for yie...
18     mr. speaker , by direction of the committee on...
19     mr. speaker , for the purpose of debate only ,...
20     mr. speaker , on march 1 , the committee on ru...
21     mr. speaker , i want to thank the gentleman fo...
22     mr. speaker , i yield myself the balance of my...
23     mr. speaker , i want to take this opportunity ...
24         i am , mr. speaker , in its present form . \n
25     mr. speaker , i yield myself such time as i ma...
26     mr. speaker , i yield myself the balance of my...
27           mr. speaker , i demand a recorded vote . \n
28              mr. chairman , i offer an amendment . \n
29     mr. chairman , i yield myself as much time as ...
                             ...                        
672    mr. chairman , i yield myself such time as i m...
673    mr. chairman , i yield 4 minutes to the gentle...
674    mr. chairman , before i recognize my colleague...
675    mr. chairman , it gives me great pleasure to y...
676    mr. chairman , i have no further speakers , an...
677    mr. chairman , i just have a parliamentary inq...
678    mr. chairman , i know my colleague will close ...
679    mr. chairman , i yield myself such time as i m...
680         mr. chairman , i demand a recorded vote . \n
681    mr. chairman , i rise in support of h.r. 420 ,...
682    mr. chairman , i yield myself such time as i m...
683    mr. chairman , i yield 5 minutes to the gentle...
684    mr. chairman , i yield 4 minutes to the gentle...
685    mr. chairman , i reserve the balance of my tim...
686    mr. chairman , i yield myself such time as i m...
687    mr. chairman , i yield back the balance of my ...
688    mr. chairman , i yield myself such time as i m...
689    mr. chairman , i yield myself such time as i m...
690    mr. chairman , i yield myself 30 seconds . \nm...
691    mr. chairman , i believe i have the right to c...
692    mr. chairman , i yield myself such time as i m...
693    mr. speaker , i rise in opposition to the moti...
694    mr. speaker , i oppose this completely irrelev...
695          mr. speaker , i demand a recorded vote . \n
696    mr. chairman , i rise in opposition to the so-...
697    mr. chairman , i rise in opposition of h.r. 42...
698    mr. chairman , i am not opposed to changing fe...
699         yes , mr. speaker , in its present form . \n
700    mr. speaker , if bills in this chamber require...
701          mr. speaker , i demand a recorded vote . \n
Name: content, dtype: object



In [160]:

    
results = pd.DataFrame()
results['content'] = speeches_df['content']
results['category'] = km.labels_
results









    Out[160]:






  
    
      
      content
      category
    
  
  
    
      0
      mr. chairman , i thank the gentlewoman for yie...
      1
    
    
      1
      mr. chairman , i want to thank my good friend ...
      1
    
    
      2
      mr. chairman , i rise to make two fundamental ...
      1
    
    
      3
      mr. chairman , reclaiming my time , let me mak...
      1
    
    
      4
      mr. chairman , i thank my distinguished collea...
      1
    
    
      5
      i yield to the gentleman from illinois . \n
      4
    
    
      6
      mr. chairman , reclaiming my time , the fact i...
      1
    
    
      7
      i yield to the gentleman from illinois . \n
      4
    
    
      8
      mr. chairman , reclaiming my time , i would be...
      2
    
    
      9
      mr. chairman , i do not have it on the top of ...
      7
    
    
      10
      okay . \nso we do not have that answer . \nlet...
      1
    
    
      11
      mr. chairman , i would suggest that with these...
      1
    
    
      12
      mr. chairman , i yield myself such time as i m...
      1
    
    
      13
      mr. chairman , i yield myself the balance of t...
      7
    
    
      14
      mr. chairman , i demand a recorded vote . \n
      0
    
    
      15
      mr. chairman , i appreciated the gentleman fro...
      1
    
    
      16
      mr. chairman , i am pleased to join the gentle...
      1
    
    
      17
      mr. chairman , i thank the gentlewoman for yie...
      2
    
    
      18
      mr. speaker , by direction of the committee on...
      5
    
    
      19
      mr. speaker , for the purpose of debate only ,...
      2
    
    
      20
      mr. speaker , on march 1 , the committee on ru...
      1
    
    
      21
      mr. speaker , i want to thank the gentleman fo...
      2
    
    
      22
      mr. speaker , i yield myself the balance of my...
      1
    
    
      23
      mr. speaker , i want to take this opportunity ...
      6
    
    
      24
      i am , mr. speaker , in its present form . \n
      5
    
    
      25
      mr. speaker , i yield myself such time as i ma...
      2
    
    
      26
      mr. speaker , i yield myself the balance of my...
      2
    
    
      27
      mr. speaker , i demand a recorded vote . \n
      5
    
    
      28
      mr. chairman , i offer an amendment . \n
      6
    
    
      29
      mr. chairman , i yield myself as much time as ...
      7
    
    
      ...
      ...
      ...
    
    
      672
      mr. chairman , i yield myself such time as i m...
      1
    
    
      673
      mr. chairman , i yield 4 minutes to the gentle...
      0
    
    
      674
      mr. chairman , before i recognize my colleague...
      0
    
    
      675
      mr. chairman , it gives me great pleasure to y...
      2
    
    
      676
      mr. chairman , i have no further speakers , an...
      2
    
    
      677
      mr. chairman , i just have a parliamentary inq...
      6
    
    
      678
      mr. chairman , i know my colleague will close ...
      2
    
    
      679
      mr. chairman , i yield myself such time as i m...
      7
    
    
      680
      mr. chairman , i demand a recorded vote . \n
      0
    
    
      681
      mr. chairman , i rise in support of h.r. 420 ,...
      7
    
    
      682
      mr. chairman , i yield myself such time as i m...
      7
    
    
      683
      mr. chairman , i yield 5 minutes to the gentle...
      0
    
    
      684
      mr. chairman , i yield 4 minutes to the gentle...
      0
    
    
      685
      mr. chairman , i reserve the balance of my tim...
      2
    
    
      686
      mr. chairman , i yield myself such time as i m...
      7
    
    
      687
      mr. chairman , i yield back the balance of my ...
      2
    
    
      688
      mr. chairman , i yield myself such time as i m...
      7
    
    
      689
      mr. chairman , i yield myself such time as i m...
      6
    
    
      690
      mr. chairman , i yield myself 30 seconds . \nm...
      0
    
    
      691
      mr. chairman , i believe i have the right to c...
      2
    
    
      692
      mr. chairman , i yield myself such time as i m...
      7
    
    
      693
      mr. speaker , i rise in opposition to the moti...
      5
    
    
      694
      mr. speaker , i oppose this completely irrelev...
      1
    
    
      695
      mr. speaker , i demand a recorded vote . \n
      5
    
    
      696
      mr. chairman , i rise in opposition to the so-...
      1
    
    
      697
      mr. chairman , i rise in opposition of h.r. 42...
      1
    
    
      698
      mr. chairman , i am not opposed to changing fe...
      1
    
    
      699
      yes , mr. speaker , in its present form . \n
      5
    
    
      700
      mr. speaker , if bills in this chamber require...
      1
    
    
      701
      mr. speaker , i demand a recorded vote . \n
      5
    
  

702 rows × 2 columns



In [161]:

    
vectorizer.get_feature_names()









    Out[161]:





['amend', 'chairman', 'head', 'mr', 's', 'start', 'thi', 'time']



In [ ]:



In [ ]:



In [167]:

    
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
df









    Out[167]:






  
    
      
      amend
      chairman
      head
      mr
      s
      start
      thi
      time
    
  
  
    
      0
      0.125349
      0.144859
      0.000000
      0.074980
      0.071464
      0.000000
      0.962678
      0.160706
    
    
      1
      0.116806
      0.179982
      0.000000
      0.279478
      0.000000
      0.176796
      0.897067
      0.199671
    
    
      2
      0.000000
      0.146615
      0.000000
      0.170749
      0.000000
      0.000000
      0.974345
      0.000000
    
    
      3
      0.000000
      0.187550
      0.000000
      0.218422
      0.000000
      0.000000
      0.934786
      0.208067
    
    
      4
      0.207346
      0.159747
      0.000000
      0.248056
      0.236425
      0.000000
      0.884677
      0.177222
    
    
      5
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      6
      0.000000
      0.187830
      0.000000
      0.145832
      0.277988
      0.000000
      0.832161
      0.416754
    
    
      7
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      8
      0.000000
      0.594065
      0.000000
      0.461234
      0.000000
      0.000000
      0.000000
      0.659052
    
    
      9
      0.000000
      0.580268
      0.569997
      0.225261
      0.429398
      0.000000
      0.321353
      0.000000
    
    
      10
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      1.000000
      0.000000
    
    
      11
      0.000000
      0.391933
      0.000000
      0.304298
      0.000000
      0.000000
      0.868211
      0.000000
    
    
      12
      0.320114
      0.147976
      0.000000
      0.229778
      0.146003
      0.096904
      0.655593
      0.601933
    
    
      13
      0.222858
      0.343395
      0.000000
      0.533226
      0.508224
      0.000000
      0.380344
      0.380960
    
    
      14
      0.000000
      0.789878
      0.000000
      0.613264
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      15
      0.000000
      0.107972
      0.000000
      0.586810
      0.000000
      0.000000
      0.717540
      0.359351
    
    
      16
      0.206514
      0.079553
      0.000000
      0.061765
      0.000000
      0.000000
      0.969237
      0.088255
    
    
      17
      0.429689
      0.248285
      0.000000
      0.064256
      0.000000
      0.000000
      0.458334
      0.734522
    
    
      18
      0.000000
      0.000000
      0.000000
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      19
      0.000000
      0.000000
      0.000000
      0.596345
      0.000000
      0.000000
      0.567155
      0.568073
    
    
      20
      0.000000
      0.000000
      0.000000
      0.385738
      0.220591
      0.000000
      0.880457
      0.165353
    
    
      21
      0.000000
      0.361410
      0.000000
      0.841799
      0.000000
      0.000000
      0.000000
      0.400945
    
    
      22
      0.175505
      0.000000
      0.000000
      0.314945
      0.200119
      0.000000
      0.898587
      0.150007
    
    
      23
      0.707060
      0.000000
      0.000000
      0.211470
      0.000000
      0.000000
      0.603358
      0.302167
    
    
      24
      0.000000
      0.000000
      0.000000
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      25
      0.241966
      0.000000
      0.000000
      0.723681
      0.275900
      0.000000
      0.412955
      0.413624
    
    
      26
      0.000000
      0.000000
      0.000000
      0.530707
      0.000000
      0.000000
      0.378548
      0.758321
    
    
      27
      0.000000
      0.000000
      0.000000
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      28
      0.715863
      0.551525
      0.000000
      0.428206
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      29
      0.146628
      0.451870
      0.000000
      0.350833
      0.668767
      0.000000
      0.250246
      0.375976
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      672
      0.536599
      0.236237
      0.000000
      0.229269
      0.262223
      0.000000
      0.719554
      0.131040
    
    
      673
      0.000000
      0.541434
      0.000000
      0.840743
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      674
      0.389411
      0.600032
      0.000000
      0.698800
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      675
      0.000000
      0.594065
      0.000000
      0.461234
      0.000000
      0.000000
      0.000000
      0.659052
    
    
      676
      0.000000
      0.594065
      0.000000
      0.461234
      0.000000
      0.000000
      0.000000
      0.659052
    
    
      677
      0.715863
      0.551525
      0.000000
      0.428206
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      678
      0.000000
      0.594065
      0.000000
      0.461234
      0.000000
      0.000000
      0.000000
      0.659052
    
    
      679
      0.000000
      0.398550
      0.000000
      0.309436
      0.589854
      0.391495
      0.220717
      0.442149
    
    
      680
      0.000000
      0.789878
      0.000000
      0.613264
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      681
      0.000000
      0.216142
      0.000000
      0.167813
      0.639779
      0.000000
      0.718195
      0.000000
    
    
      682
      0.000000
      0.220822
      0.000000
      0.171447
      0.817042
      0.000000
      0.489165
      0.122489
    
    
      683
      0.000000
      0.541434
      0.000000
      0.840743
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      684
      0.000000
      0.789878
      0.000000
      0.613264
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      685
      0.000000
      0.594065
      0.000000
      0.461234
      0.000000
      0.000000
      0.000000
      0.659052
    
    
      686
      0.000000
      0.137266
      0.134836
      0.106574
      0.914191
      0.000000
      0.152036
      0.304564
    
    
      687
      0.000000
      0.594065
      0.000000
      0.461234
      0.000000
      0.000000
      0.000000
      0.659052
    
    
      688
      0.624638
      0.240621
      0.000000
      0.311365
      0.593532
      0.000000
      0.266512
      0.177962
    
    
      689
      0.391844
      0.452835
      0.000000
      0.351582
      0.335098
      0.000000
      0.585153
      0.251186
    
    
      690
      0.000000
      0.771580
      0.000000
      0.599057
      0.000000
      0.000000
      0.000000
      0.213996
    
    
      691
      0.000000
      0.496271
      0.000000
      0.385307
      0.000000
      0.000000
      0.549670
      0.550560
    
    
      692
      0.443789
      0.341910
      0.000000
      0.265460
      0.433737
      0.000000
      0.595100
      0.270938
    
    
      693
      0.000000
      0.000000
      0.000000
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      694
      0.000000
      0.000000
      0.000000
      0.632522
      0.301432
      0.000000
      0.676757
      0.225951
    
    
      695
      0.000000
      0.000000
      0.000000
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      696
      0.000000
      0.161538
      0.317357
      0.125418
      0.239076
      0.000000
      0.894597
      0.000000
    
    
      697
      0.000000
      0.176029
      0.000000
      0.136670
      0.000000
      0.000000
      0.974851
      0.000000
    
    
      698
      0.558905
      0.215300
      0.000000
      0.167159
      0.318644
      0.000000
      0.715398
      0.000000
    
    
      699
      0.000000
      0.000000
      0.000000
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      700
      0.000000
      0.000000
      0.000000
      0.163468
      0.311607
      0.000000
      0.932801
      0.077859
    
    
      701
      0.000000
      0.000000
      0.000000
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
  

702 rows × 8 columns



In [ ]:

Which one do you think works the best?



In [ ]:



In [ ]:



In [ ]:

Harry Potter time

I have a scraped collection of Harry Potter fanfiction at https://github.com/ledeprogram/courses/raw/master/algorithms/data/hp.zip.

I want you to read them in, vectorize them and cluster them. Use this process to find out the two types of Harry Potter fanfiction. What is your hypothesis?



In [29]:

    
!curl -LO https://github.com/ledeprogram/courses/raw/master/algorithms/data/hp.zip









    



  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   149  100   149    0     0    106      0  0:00:01  0:00:01 --:--:--   106
100 9226k  100 9226k    0     0  2668k      0  0:00:03  0:00:03 --:--:-- 7714k



In [ ]:



In [ ]:

	content	filename	pathname
0	mr. chairman , i thank the gentlewoman for yie...	052_400011_0327014_DON.txt	convote_v1.1/data_stage_one/development_set/05...
1	mr. chairman , i want to thank my good friend ...	052_400011_0327025_DON.txt	convote_v1.1/data_stage_one/development_set/05...
2	mr. chairman , i rise to make two fundamental ...	052_400011_0327044_DON.txt	convote_v1.1/data_stage_one/development_set/05...
3	mr. chairman , reclaiming my time , let me mak...	052_400011_0327046_DON.txt	convote_v1.1/data_stage_one/development_set/05...
4	mr. chairman , i thank my distinguished collea...	052_400011_1479036_DON.txt	convote_v1.1/data_stage_one/development_set/05...

	000	11	act	allow	amendment	america	american	amp	association	balance	...	trade	united	urge	vote	want	way	work	year	years	yield
0	0	1	3	0	0	0	3	0	0	0	...	0	1	0	0	1	1	0	0	0	1
1	0	0	1	1	1	0	0	0	0	1	...	0	0	0	1	1	3	0	1	0	0
2	0	0	0	0	0	0	1	0	0	0	...	0	0	0	1	0	0	0	0	0	1
3	0	0	0	0	0	1	0	0	0	1	...	0	1	0	1	1	1	0	0	0	0
4	0	0	0	0	1	0	0	0	0	0	...	0	0	0	2	0	0	0	0	0	2
5	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
6	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
7	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
8	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
9	0	0	0	1	0	0	1	0	0	0	...	0	1	0	0	0	0	0	0	0	0
10	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
11	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
12	0	1	1	0	1	0	0	0	0	1	...	0	2	0	1	2	1	0	0	0	1
13	0	8	8	0	0	0	0	0	0	1	...	0	1	0	1	0	1	0	4	1	1
14	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	0	0	0	0	0	0
15	0	0	0	0	0	0	0	0	0	0	...	0	0	1	0	0	0	3	2	0	0
16	1	0	0	0	2	1	0	0	0	0	...	0	0	0	0	0	0	0	2	0	0
17	0	0	0	2	4	0	0	0	0	0	...	0	0	0	1	4	1	3	2	1	0
18	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
19	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	2
20	0	5	1	0	0	3	2	0	0	1	...	0	1	1	0	0	0	0	0	1	0
21	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	1	0	0	0	0	1
22	0	2	0	0	1	1	0	0	0	1	...	0	0	0	0	0	1	0	0	0	1
23	0	0	0	0	4	0	0	0	0	1	...	0	0	1	0	1	0	0	0	0	1
24	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
25	2	0	0	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	1	0	2
26	0	0	0	0	0	0	0	0	0	2	...	0	0	1	0	0	0	0	0	0	2
27	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	0	0	0	0	0	0
28	0	0	0	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
29	2	0	0	0	1	0	0	0	0	1	...	0	1	0	0	0	0	2	1	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
672	0	5	1	0	7	0	0	0	0	1	...	0	1	1	0	0	0	0	0	0	1
673	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
674	0	6	0	0	0	0	0	0	0	0	...	0	0	0	0	1	0	0	1	0	1
675	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
676	0	0	0	0	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
677	0	0	0	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
678	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
679	0	4	0	0	0	0	0	0	0	1	...	0	0	1	0	2	0	0	0	0	2
680	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	0	0	0	0	0	0
681	0	0	1	0	0	0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
682	1	0	3	0	0	1	4	0	7	1	...	0	1	0	0	1	0	0	1	2	1
683	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
684	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
685	0	0	0	0	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
686	1	3	2	1	0	2	0	0	0	0	...	0	0	1	0	0	0	0	1	0	1
687	0	0	0	0	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	1
688	0	1	0	0	6	0	0	0	0	1	...	0	0	1	0	0	0	0	0	0	2
689	0	3	2	0	4	0	0	0	1	1	...	0	0	1	1	0	0	0	0	0	1
690	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	2	0	0	0	0	2
691	0	0	0	0	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
692	0	0	0	1	7	3	1	0	1	1	...	0	0	1	1	0	0	0	1	1	2
693	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
694	0	0	2	0	0	1	1	0	0	1	...	0	0	1	0	0	0	0	0	0	1
695	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	0	0	0	0	0	0
696	0	0	1	0	0	0	0	0	0	0	...	0	0	0	0	1	0	0	0	0	0
697	0	0	1	0	0	0	0	0	0	0	...	0	2	0	0	0	1	0	0	0	0
698	0	1	0	0	2	0	2	0	0	0	...	0	2	0	1	0	1	0	0	0	0
699	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
700	0	1	2	0	0	0	0	0	0	0	...	0	0	1	2	2	2	2	0	0	0
701	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	0	0	0	0	0	0

	mr	chairman	mr + chairman
0	2	3	5
1	4	2	6
2	3	2	5
3	3	2	5
4	2	1	3
5	0	0	0
6	1	1	2
7	0	0	0
8	1	1	2
9	1	2	3
10	0	0	0
11	1	1	2
12	6	3	9
13	4	2	6
14	1	1	2
15	7	1	8
16	1	1	2
17	1	3	4
18	1	0	1
19	3	0	3
20	10	0	10
21	3	1	4
22	3	0	3
23	2	0	2
24	1	0	1
25	5	0	5
26	2	0	2
27	1	0	1
28	1	1	2
29	4	4	8
...	...	...	...
672	5	4	9
673	2	1	3
674	3	2	5
675	1	1	2
676	1	1	2
677	1	1	2
678	1	1	2
679	2	2	4
680	1	1	2
681	1	1	2
682	4	4	8
683	2	1	3
684	2	2	4
685	1	1	2
686	2	2	4
687	1	1	2
688	5	3	8
689	6	6	12
690	4	4	8
691	1	1	2
692	7	7	14
693	1	0	1
694	4	0	4
695	1	0	1
696	1	1	2
697	1	1	2
698	1	1	2
699	1	0	1
700	3	0	3
701	1	0	1

	0	1	2	3	4	5	6	7
0	0.125349	0.144859	0.000000	0.074980	0.071464	0.000000	0.962678	0.160706
1	0.116806	0.179982	0.000000	0.279478	0.000000	0.176796	0.897067	0.199671
2	0.000000	0.146615	0.000000	0.170749	0.000000	0.000000	0.974345	0.000000
3	0.000000	0.187550	0.000000	0.218422	0.000000	0.000000	0.934786	0.208067
4	0.207346	0.159747	0.000000	0.248056	0.236425	0.000000	0.884677	0.177222
5	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
6	0.000000	0.187830	0.000000	0.145832	0.277988	0.000000	0.832161	0.416754
7	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
8	0.000000	0.594065	0.000000	0.461234	0.000000	0.000000	0.000000	0.659052
9	0.000000	0.580268	0.569997	0.225261	0.429398	0.000000	0.321353	0.000000
10	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000
11	0.000000	0.391933	0.000000	0.304298	0.000000	0.000000	0.868211	0.000000
12	0.320114	0.147976	0.000000	0.229778	0.146003	0.096904	0.655593	0.601933
13	0.222858	0.343395	0.000000	0.533226	0.508224	0.000000	0.380344	0.380960
14	0.000000	0.789878	0.000000	0.613264	0.000000	0.000000	0.000000	0.000000
15	0.000000	0.107972	0.000000	0.586810	0.000000	0.000000	0.717540	0.359351
16	0.206514	0.079553	0.000000	0.061765	0.000000	0.000000	0.969237	0.088255
17	0.429689	0.248285	0.000000	0.064256	0.000000	0.000000	0.458334	0.734522
18	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000
19	0.000000	0.000000	0.000000	0.596345	0.000000	0.000000	0.567155	0.568073
20	0.000000	0.000000	0.000000	0.385738	0.220591	0.000000	0.880457	0.165353
21	0.000000	0.361410	0.000000	0.841799	0.000000	0.000000	0.000000	0.400945
22	0.175505	0.000000	0.000000	0.314945	0.200119	0.000000	0.898587	0.150007
23	0.707060	0.000000	0.000000	0.211470	0.000000	0.000000	0.603358	0.302167
24	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000
25	0.241966	0.000000	0.000000	0.723681	0.275900	0.000000	0.412955	0.413624
26	0.000000	0.000000	0.000000	0.530707	0.000000	0.000000	0.378548	0.758321
27	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000
28	0.715863	0.551525	0.000000	0.428206	0.000000	0.000000	0.000000	0.000000
29	0.146628	0.451870	0.000000	0.350833	0.668767	0.000000	0.250246	0.375976
...	...	...	...	...	...	...	...	...
672	0.536599	0.236237	0.000000	0.229269	0.262223	0.000000	0.719554	0.131040
673	0.000000	0.541434	0.000000	0.840743	0.000000	0.000000	0.000000	0.000000
674	0.389411	0.600032	0.000000	0.698800	0.000000	0.000000	0.000000	0.000000
675	0.000000	0.594065	0.000000	0.461234	0.000000	0.000000	0.000000	0.659052
676	0.000000	0.594065	0.000000	0.461234	0.000000	0.000000	0.000000	0.659052
677	0.715863	0.551525	0.000000	0.428206	0.000000	0.000000	0.000000	0.000000
678	0.000000	0.594065	0.000000	0.461234	0.000000	0.000000	0.000000	0.659052
679	0.000000	0.398550	0.000000	0.309436	0.589854	0.391495	0.220717	0.442149
680	0.000000	0.789878	0.000000	0.613264	0.000000	0.000000	0.000000	0.000000
681	0.000000	0.216142	0.000000	0.167813	0.639779	0.000000	0.718195	0.000000
682	0.000000	0.220822	0.000000	0.171447	0.817042	0.000000	0.489165	0.122489
683	0.000000	0.541434	0.000000	0.840743	0.000000	0.000000	0.000000	0.000000
684	0.000000	0.789878	0.000000	0.613264	0.000000	0.000000	0.000000	0.000000
685	0.000000	0.594065	0.000000	0.461234	0.000000	0.000000	0.000000	0.659052
686	0.000000	0.137266	0.134836	0.106574	0.914191	0.000000	0.152036	0.304564
687	0.000000	0.594065	0.000000	0.461234	0.000000	0.000000	0.000000	0.659052
688	0.624638	0.240621	0.000000	0.311365	0.593532	0.000000	0.266512	0.177962
689	0.391844	0.452835	0.000000	0.351582	0.335098	0.000000	0.585153	0.251186
690	0.000000	0.771580	0.000000	0.599057	0.000000	0.000000	0.000000	0.213996
691	0.000000	0.496271	0.000000	0.385307	0.000000	0.000000	0.549670	0.550560
692	0.443789	0.341910	0.000000	0.265460	0.433737	0.000000	0.595100	0.270938
693	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000
694	0.000000	0.000000	0.000000	0.632522	0.301432	0.000000	0.676757	0.225951
695	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000
696	0.000000	0.161538	0.317357	0.125418	0.239076	0.000000	0.894597	0.000000
697	0.000000	0.176029	0.000000	0.136670	0.000000	0.000000	0.974851	0.000000
698	0.558905	0.215300	0.000000	0.167159	0.318644	0.000000	0.715398	0.000000
699	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000
700	0.000000	0.000000	0.000000	0.163468	0.311607	0.000000	0.932801	0.077859
701	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000

	000	11	act	allow	amendment	america	american	amp	association	balance	...	trade	united	urge	vote	want	way	work	year	years	yield
0	0	1	3	0	0	0	3	0	0	0	...	0	1	0	0	1	1	0	0	0	1
1	0	0	1	1	1	0	0	0	0	1	...	0	0	0	1	1	3	0	1	0	0
2	0	0	0	0	0	0	1	0	0	0	...	0	0	0	1	0	0	0	0	0	1
3	0	0	0	0	0	1	0	0	0	1	...	0	1	0	1	1	1	0	0	0	0
4	0	0	0	0	1	0	0	0	0	0	...	0	0	0	2	0	0	0	0	0	2
5	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
6	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
7	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
8	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
9	0	0	0	1	0	0	1	0	0	0	...	0	1	0	0	0	0	0	0	0	0
10	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
11	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
12	0	1	1	0	1	0	0	0	0	1	...	0	2	0	1	2	1	0	0	0	1
13	0	8	8	0	0	0	0	0	0	1	...	0	1	0	1	0	1	0	4	1	1
14	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	0	0	0	0	0	0
15	0	0	0	0	0	0	0	0	0	0	...	0	0	1	0	0	0	3	2	0	0
16	1	0	0	0	2	1	0	0	0	0	...	0	0	0	0	0	0	0	2	0	0
17	0	0	0	2	4	0	0	0	0	0	...	0	0	0	1	4	1	3	2	1	0
18	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
19	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	2
20	0	5	1	0	0	3	2	0	0	1	...	0	1	1	0	0	0	0	0	1	0
21	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	1	0	0	0	0	1
22	0	2	0	0	1	1	0	0	0	1	...	0	0	0	0	0	1	0	0	0	1
23	0	0	0	0	4	0	0	0	0	1	...	0	0	1	0	1	0	0	0	0	1
24	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
25	2	0	0	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	1	0	2
26	0	0	0	0	0	0	0	0	0	2	...	0	0	1	0	0	0	0	0	0	2
27	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	0	0	0	0	0	0
28	0	0	0	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
29	2	0	0	0	1	0	0	0	0	1	...	0	1	0	0	0	0	2	1	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
672	0	5	1	0	7	0	0	0	0	1	...	0	1	1	0	0	0	0	0	0	1
673	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
674	0	6	0	0	0	0	0	0	0	0	...	0	0	0	0	1	0	0	1	0	1
675	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
676	0	0	0	0	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
677	0	0	0	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
678	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
679	0	4	0	0	0	0	0	0	0	1	...	0	0	1	0	2	0	0	0	0	2
680	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	0	0	0	0	0	0
681	0	0	1	0	0	0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
682	1	0	3	0	0	1	4	0	7	1	...	0	1	0	0	1	0	0	1	2	1
683	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
684	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
685	0	0	0	0	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
686	1	3	2	1	0	2	0	0	0	0	...	0	0	1	0	0	0	0	1	0	1
687	0	0	0	0	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	1
688	0	1	0	0	6	0	0	0	0	1	...	0	0	1	0	0	0	0	0	0	2
689	0	3	2	0	4	0	0	0	1	1	...	0	0	1	1	0	0	0	0	0	1
690	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	2	0	0	0	0	2
691	0	0	0	0	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
692	0	0	0	1	7	3	1	0	1	1	...	0	0	1	1	0	0	0	1	1	2
693	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
694	0	0	2	0	0	1	1	0	0	1	...	0	0	1	0	0	0	0	0	0	1
695	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	0	0	0	0	0	0
696	0	0	1	0	0	0	0	0	0	0	...	0	0	0	0	1	0	0	0	0	0
697	0	0	1	0	0	0	0	0	0	0	...	0	2	0	0	0	1	0	0	0	0
698	0	1	0	0	2	0	2	0	0	0	...	0	2	0	1	0	1	0	0	0	0
699	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
700	0	1	2	0	0	0	0	0	0	0	...	0	0	1	2	2	2	2	0	0	0
701	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	0	0	0	0	0	0

	mr	chairman	mr + chairman
0	2	3	5
1	4	2	6
2	3	2	5
3	3	2	5
4	2	1	3
5	0	0	0
6	1	1	2
7	0	0	0
8	1	1	2
9	1	2	3
10	0	0	0
11	1	1	2
12	6	3	9
13	4	2	6
14	1	1	2
15	7	1	8
16	1	1	2
17	1	3	4
18	1	0	1
19	3	0	3
20	10	0	10
21	3	1	4
22	3	0	3
23	2	0	2
24	1	0	1
25	5	0	5
26	2	0	2
27	1	0	1
28	1	1	2
29	4	4	8
...	...	...	...
672	5	4	9
673	2	1	3
674	3	2	5
675	1	1	2
676	1	1	2
677	1	1	2
678	1	1	2
679	2	2	4
680	1	1	2
681	1	1	2
682	4	4	8
683	2	1	3
684	2	2	4
685	1	1	2
686	2	2	4
687	1	1	2
688	5	3	8
689	6	6	12
690	4	4	8
691	1	1	2
692	7	7	14
693	1	0	1
694	4	0	4
695	1	0	1
696	1	1	2
697	1	1	2
698	1	1	2
699	1	0	1
700	3	0	3
701	1	0	1

	000	11	act	allow	amendment	america	american	amp	association	balance	...	trade	united	urge	vote	want	way	work	year	years	yield
0	0	1	3	0	0	0	3	0	0	0	...	0	1	0	0	1	1	0	0	0	1
1	0	0	1	1	1	0	0	0	0	1	...	0	0	0	1	1	3	0	1	0	0
2	0	0	0	0	0	0	1	0	0	0	...	0	0	0	1	0	0	0	0	0	1
3	0	0	0	0	0	1	0	0	0	1	...	0	1	0	1	1	1	0	0	0	0
4	0	0	0	0	1	0	0	0	0	0	...	0	0	0	2	0	0	0	0	0	2
5	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
6	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
7	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
8	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
9	0	0	0	1	0	0	1	0	0	0	...	0	1	0	0	0	0	0	0	0	0
10	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
11	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
12	0	1	1	0	1	0	0	0	0	1	...	0	2	0	1	2	1	0	0	0	1
13	0	8	8	0	0	0	0	0	0	1	...	0	1	0	1	0	1	0	4	1	1
14	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	0	0	0	0	0	0
15	0	0	0	0	0	0	0	0	0	0	...	0	0	1	0	0	0	3	2	0	0
16	1	0	0	0	2	1	0	0	0	0	...	0	0	0	0	0	0	0	2	0	0
17	0	0	0	2	4	0	0	0	0	0	...	0	0	0	1	4	1	3	2	1	0
18	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
19	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	2
20	0	5	1	0	0	3	2	0	0	1	...	0	1	1	0	0	0	0	0	1	0
21	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	1	0	0	0	0	1
22	0	2	0	0	1	1	0	0	0	1	...	0	0	0	0	0	1	0	0	0	1
23	0	0	0	0	4	0	0	0	0	1	...	0	0	1	0	1	0	0	0	0	1
24	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
25	2	0	0	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	1	0	2
26	0	0	0	0	0	0	0	0	0	2	...	0	0	1	0	0	0	0	0	0	2
27	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	0	0	0	0	0	0
28	0	0	0	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
29	2	0	0	0	1	0	0	0	0	1	...	0	1	0	0	0	0	2	1	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
672	0	5	1	0	7	0	0	0	0	1	...	0	1	1	0	0	0	0	0	0	1
673	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
674	0	6	0	0	0	0	0	0	0	0	...	0	0	0	0	1	0	0	1	0	1
675	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
676	0	0	0	0	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
677	0	0	0	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
678	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
679	0	4	0	0	0	0	0	0	0	1	...	0	0	1	0	2	0	0	0	0	2
680	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	0	0	0	0	0	0
681	0	0	1	0	0	0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
682	1	0	3	0	0	1	4	0	7	1	...	0	1	0	0	1	0	0	1	2	1
683	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
684	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
685	0	0	0	0	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
686	1	3	2	1	0	2	0	0	0	0	...	0	0	1	0	0	0	0	1	0	1
687	0	0	0	0	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	1
688	0	1	0	0	6	0	0	0	0	1	...	0	0	1	0	0	0	0	0	0	2
689	0	3	2	0	4	0	0	0	1	1	...	0	0	1	1	0	0	0	0	0	1
690	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	2	0	0	0	0	2
691	0	0	0	0	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
692	0	0	0	1	7	3	1	0	1	1	...	0	0	1	1	0	0	0	1	1	2
693	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
694	0	0	2	0	0	1	1	0	0	1	...	0	0	1	0	0	0	0	0	0	1
695	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	0	0	0	0	0	0
696	0	0	1	0	0	0	0	0	0	0	...	0	0	0	0	1	0	0	0	0	0
697	0	0	1	0	0	0	0	0	0	0	...	0	2	0	0	0	1	0	0	0	0
698	0	1	0	0	2	0	2	0	0	0	...	0	2	0	1	0	1	0	0	0	0
699	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
700	0	1	2	0	0	0	0	0	0	0	...	0	0	1	2	2	2	2	0	0	0
701	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	0	0	0	0	0	0

	mr	chairman	mr + chairman
0	2	3	5
1	4	2	6
2	3	2	5
3	3	2	5
4	2	1	3
5	0	0	0
6	1	1	2
7	0	0	0
8	1	1	2
9	1	2	3
10	0	0	0
11	1	1	2
12	6	3	9
13	4	2	6
14	1	1	2
15	7	1	8
16	1	1	2
17	1	3	4
18	1	0	1
19	3	0	3
20	10	0	10
21	3	1	4
22	3	0	3
23	2	0	2
24	1	0	1
25	5	0	5
26	2	0	2
27	1	0	1
28	1	1	2
29	4	4	8
...	...	...	...
672	5	4	9
673	2	1	3
674	3	2	5
675	1	1	2
676	1	1	2
677	1	1	2
678	1	1	2
679	2	2	4
680	1	1	2
681	1	1	2
682	4	4	8
683	2	1	3
684	2	2	4
685	1	1	2
686	2	2	4
687	1	1	2
688	5	3	8
689	6	6	12
690	4	4	8
691	1	1	2
692	7	7	14
693	1	0	1
694	4	0	4
695	1	0	1
696	1	1	2
697	1	1	2
698	1	1	2
699	1	0	1
700	3	0	3
701	1	0	1